add NCCL support to add_ghost_cells and operators in /basicoperators #137

tharittk · 2025-06-03T14:23:10Z

I added the support for NCCL to the MPIVStack. Along the way, I discover some interesting issue and it may be worth discussing so I initiate this drafted PR

Change Made
DistributedArray.py

MPIVStack adjoint operation takes decorator @reshape which calls add_ghost_cells() in this file. add_ghost_cells() has to be modified to support NCCL. There two points I want to point out.

the call to self._allgather(cell_fronts): this cell_fronts are the metadata and is small in size (list of ints, size = total ranks). Under the current implementation, I will dispatch to call NCCL if NCCL is enabled. Should we enforce it to always use MPI instead ?
the call to self.base_comm.send: this sends the ghost cells to peer and the sent array can be large. Under the current code, it will always use MPI. If we want to have it use NCCL, we will need to implement point-to-point NCCL. Luckily, CuPy NCCL supports this. I think there will be just adding more call in _nccl.py

VStack.py

the operator now takes one option argument base_comm_nccl - just like what we did in DistrbiutedArray. I did not change the MPILinearOperator interface to take this argument though. But I don't have a strong opinion on either case.
the output y is from Op @ x and Op.H @ x now initialized to have the same base_comm_nccl as x i.e., if x lives in NCCL, y should lives and communicate with NCCL too.

test_stack_nccl.py

this closely follows its counterpart MPI test_stack.py but tests explicitly in NCCL environment
script for testing HStack operator

mrava87 · 2025-06-03T20:17:17Z

@tharittk, very good start!

I am going to reply to some of your comments/questions and will look more closely at your code in the next few days.

self._allgather(cell_fronts): I may have said otherwise on slack, but I agree that this is only dealing with indices so I would leave it to MPI even when we use NCCL as we did for other similar operations in the DistributedArray PR.
send/recv I agree that we should follow the same approach and have their implementation in the _nccl file and then dispatch to the correct one based on whether the base_comm_nccl is present or not... and yes, we should definitely allow this to be using NCCL if we have a base_comm_nccl because this could be sending larger arrays and so be a bottleneck that NCCL can speed up.
I also don't have strong feelings regarding whether we should add base_comm_nccl to MPILinearOperator or not... we need the MPI basecomm to get rank and size but other than that we don't use it, so probably not worth also passing an d storing the NCCL communicator. We will however need to change MPILinearOperator a bit (here or in another PR later) as MPILinearOperator is used to wrap PyLops operators that we want to be identically applied over every rank to a DistributedArray with BROADCAST partition. So we will need to modify how we create y (

pylops-mpi/pylops_mpi/LinearOperator.py

Line 87 in 97f8f5a

y = DistributedArray(global_shape=self.shape[0],

) passing base_comm_nccl=x.base_comm_nccl to ensure that this is not lost in y in case the next operator in the chain has some form of communications (say e.g. we apply a FirstDerivative) and there I think it would be good that the input Distributed array carries the correct base_comm_nccl (in case we end up in some situation that we want to do some checks)...

tharittk · 2025-06-04T02:24:36Z

I see, for the third point about passing base_comm_nccl=x.base_comm_nccl, currently I test on calling Op @ x which calls the concrete instance of MPIVStack and thus nccl_comm is passed (even when I did not change the MPILinearOperator interface). So when I check which collective calls y operates, it says NCCL.

If somehow the Op @ x is called and Op is an instance of MPILinearOperator, nccl_commis lost becauseynow takes the default value ofbase_comm_nccl=None`.That is something I did not catch.

…rand x instead

…Derivative

tharittk · 2025-06-08T05:43:59Z

The most recent commits reflect some changes I want to point out

Remove of base_comm_nccl argument to the concrete MPILinearOperator class. The argument seems unnecessary and such information can be taken from operand x whenever math _matvec or _rmatvec is called.
adding of nccl_send and nccl_recv - to test this implementation, the test case of MPIVStack is not enough. it requires to have a test that cell_front is not zero i.e., not sending empty buffer. So I add the test_blockdiag_nccl.py. In this file, the StackedBDiag has theFirstDerivative called. This first derivative triggners the nccl_send and nccl_recv with meaning cells.

mrava87

@tharittk very good!

Everything makes sense to me and I agree with the code changes. I just left some minor suggestions. After you have taken those into account, I think we can merge this PR 😄

One more minor thing, as you progress don't forget to keep the table in the gpu.rst up to date.

mrava87 · 2025-06-08T20:58:33Z

pylops_mpi/utils/decorators.py

                local_shapes = None
                global_shape = getattr(self, "dims")
            arr = DistributedArray(global_shape=global_shape,
+                                   base_comm_nccl=x.base_comm_nccl,


Since we are changing this, I think it would be safe to also pass base_comm_nccl=x.base_comm.. I think in the past this never led to any issue as we probably always used MPI.COMM_WORLD but it's good not to assume this to be always the case 😄 (@rohanbabbar04, agree?)

Suggested change

base_comm_nccl=x.base_comm_nccl,

base_comm=x.base_comm,

base_comm_nccl=x.base_comm_nccl,

pylops_mpi/DistributedArray.py

pylops_mpi/utils/_nccl.py

mrava87 · 2025-06-08T21:28:25Z

pylops_mpi/utils/_nccl.py

 __all__ = [
    "initialize_nccl_comm",
    "nccl_split",
    "nccl_allgather",


I think it may be good to add all nccl_* methods to the Utils section of https://github.com/PyLops/pylops-mpi/blob/main/docs/source/api/index.rst

Alright, I can do that. Maybe in the other PR ?

Sounds good. If it is very small like this we can go for the same PR, if it is something a bit more consistent like the changes you made previously, it is good practice to have a separate documentation-only PR😄

…and second order

tharittk · 2025-06-10T14:56:24Z

Since the PR involves adding NCCL support to the _send and _recv which directly impact the add_ghost_cells, I decide to make change to FirstDerivative and SecondDerivative so that they support NCCL. These implicitly make Gradient and Laplacian work as well - those two require no code changes.

mrava87

@tharittk I have reviewed the new additions and they look great to me!

There are still a few conversations to resolve (for one of them we can hopefully get @rohanbabbar04's opinion), and I would like to hear from @rohanbabbar04 if he has any general comment, after that I will merge.

rohanbabbar04 · 2025-06-14T05:50:46Z

Sorry, I missed your messages. I’ll review the PR in a day or two and share my comments.

rohanbabbar04

In all cases, I would use base_comm = x.base_comm along with base_comm_nccl = x.base_comm_nccl everywhere, since we handle both cases. This should not cause any issues, as base_comm changes to MPI.COMM_WORLD when base_comm_nccl is not None.

rohanbabbar04 · 2025-06-17T13:45:28Z

pylops_mpi/utils/decorators.py

                local_shapes = None
                global_shape = getattr(self, "dims")
            arr = DistributedArray(global_shape=global_shape,
+                                   base_comm_nccl=x.base_comm_nccl,


Suggested change

base_comm_nccl=x.base_comm_nccl,

base_comm=x.base_comm,

base_comm_nccl=x.base_comm_nccl,

tharittk · 2025-06-17T14:48:45Z

Thanks for the review @rohanbabbar04 @mrava87 !
I have pushed latest commit to reflect the suggested changes.

mrava87 · 2025-06-17T19:33:16Z

Great, I am going to merge this. @tharittk great work 😄

tharittk added 2 commits June 3, 2025 20:55

support nccl in add_ghost_cells and NCCL-VStack

b567f86

minor import msg fix

7ae78a9

tharittk added 3 commits June 4, 2025 22:03

nccl for HStack Op

f80417d

remove nccl_comm from Op constructor and take base_nccl_comm from ope…

3848408

…rand x instead

point-to-point (send/recv) using NCCL. Testsed with BlockDiag & First…

58f1305

…Derivative

tharittk marked this pull request as ready for review June 8, 2025 04:09

mrava87 reviewed Jun 8, 2025

View reviewed changes

tharittk added 2 commits June 9, 2025 20:36

small fixes based on some of PR comments

ae7190c

nccl support for SecondDerivative and test_derivative_nccl for first …

fb14c86

…and second order

tharittk changed the title ~~support nccl in add_ghost_cells and NCCL-VStack~~ add NCCL support to add_ghost_cells and operators in /basicoperators Jun 10, 2025

mrava87 approved these changes Jun 11, 2025

View reviewed changes

rohanbabbar04 requested changes Jun 17, 2025

View reviewed changes

explicitly pass x.base_comm to DistributedArray as suggested in PR

d7d07ab

rohanbabbar04 approved these changes Jun 17, 2025

View reviewed changes

mrava87 merged commit 9e25d8e into PyLops:main Jun 17, 2025
61 checks passed

tharittk mentioned this pull request Jun 18, 2025

NCCL support for Stacked op/array & Solver with doc and tutorial update #141

Merged

	base_comm_nccl=x.base_comm_nccl,
	base_comm=x.base_comm,
	base_comm_nccl=x.base_comm_nccl,

add NCCL support to add_ghost_cells and operators in /basicoperators #137

add NCCL support to add_ghost_cells and operators in /basicoperators #137

Uh oh!

Conversation

tharittk commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrava87 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tharittk commented Jun 4, 2025

Uh oh!

tharittk commented Jun 8, 2025

Uh oh!

mrava87 left a comment

Choose a reason for hiding this comment

Uh oh!

mrava87 Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

rohanbabbar04 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrava87 Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

tharittk Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

mrava87 Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

tharittk commented Jun 10, 2025

Uh oh!

mrava87 left a comment

Choose a reason for hiding this comment

Uh oh!

rohanbabbar04 commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohanbabbar04 left a comment

Choose a reason for hiding this comment

Uh oh!

rohanbabbar04 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

tharittk commented Jun 17, 2025

Uh oh!

mrava87 commented Jun 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tharittk commented Jun 3, 2025 •

edited

Loading

mrava87 commented Jun 3, 2025 •

edited

Loading

rohanbabbar04 commented Jun 14, 2025 •

edited

Loading